Here’s the link to the repo on github: https://github.com/oj-lappi/IODS-project
I’d like to learn about statistical tests and possibly how to add R to data pipelines.
I don’t think R will replace my current tools for analyzing and plotting data, but I do want to at least see how feasible it would be to add R as a tool for developing statistical analyses for experiments, and possibly writing some small programs for quick statistical analysis in postprocessing of experiments.
I’m familiar with git, it’s a daily driver for me, I tend to use gnuplot and python for quick plotting, and either use python, UNIX tools or custom C++ programs for data analysis utilities.
I’m interested in creating automatic data pipelines that process data in object storage and upload plots to a dashboard. R+Github pages might be a good way to do that, although I would prefer an on-prem solution (HU’s gitlab maybe?) or a self-hosted website.
I have ssh keys set up on each of my computers, and a personal access token would be less flexible than that.
I can push and pull with this setup from RStudio.
I checked the exercises and followed some of them along locally, I think I’m getting the hang of it :).
# Timestamp:
date()
## [1] "Tue Nov 28 01:27:01 2023"
#Dependencies
#install.packages(c("readr","lmtest", "dplyr"))
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Timestamp
date()
## [1] "Tue Nov 28 01:27:01 2023"
lrn2014 <- read_csv('data/learning2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "dimensions: 166" "dimensions: 7"
There are 166 rows with 7 columns each.
The spec is:
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
gender seems to be a categorical value, so let’s see the
number of rows per gender (sex):
## # A tibble: 2 × 2
## gender count
## <chr> <int>
## 1 F 110
## 2 M 56
110 F, and 56 M, there is a skew towards female students in the dataset. Let’s plot that.
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
fig. 1.1, sex distributions
Let’s plot the age distribution of the students.
lrn2014 %>%
ggplot(aes(x = age)) +
stat_count()
fig. 1.2, age distributions
min(lrn2014$age)
## [1] 17
max(lrn2014$age)
## [1] 55
median(lrn2014$age)
## [1] 22
The age range is from 17 to 55, and the median is 22. Visually inspecting the distribution, the mode of the distribution is early twenties, as you would expect, although there is a long tail.
Let’s combine the two columns into a classic population pyramid, or age-sex pyramid.
But that’s not exactly what we want. It turns out a population pyramid is not an out-of-the-box plot we can easily produce, we have to manually calculate bins and do some magic.
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
group_by(gender, age_bin) %>% # Group by bin and gender
summarize(count =n()) %>% # Sum over groups
mutate(count =
if_else(gender == 'M', -count, count)) %>% # Turn one negative
ggplot(aes(x=count, y = age_bin)) +
geom_col(aes(fill=gender))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3a, population pyramid
There are very few male students under 20, I would speculate that this is due to Finnish army conscription, otherwise the distribution seems roughly equal on the female and male sides.
We can of course bin by year instead of 5 years, and we get a higher resolution but more noise.
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3b, population pyramid #2, fine-grained bins
There is one peculiar decline in the female student participation around ~26-28 which jumps back after thirty. This might be a maternity effect, but this is highly speculative, there’s very few samples in this dataset.
Let’s look at exam scores:
paste("median:", median(lrn2014$points), ", mean:",mean(lrn2014$points), ", standard deviation:", sd(lrn2014$points))
## [1] "median: 23 , mean: 22.7168674698795 , standard deviation: 5.89488364245529"
Let’s look at the full distribution usin
geom_density.
lrn2014 %>%
ggplot(aes(x=points)) +
geom_density()
fig. 1.4a, exam score distribution
There’s a curious valley in the density at around 12-14 points. Let’s look closer.
lrn2014 %>%
ggplot(aes(x=points, tickwidth=1)) +
geom_histogram(boundary=0,binwidth=1) +
scale_x_continuous(breaks = seq(0, 40, by = 1))
fig. 1.4b, exam score histogram
So no students got 12,13, or 14 points. The jump from 11 to 15 must be behind some barrier. Let’s look at our two demographic variables.
lrn2014 %>%
group_by(gender) %>%
ggplot(aes(x=points)) +
geom_histogram(boundary=0,binwidth=2, aes(fill=factor(age))) +
facet_wrap(~gender)
fig. 1.5a, exam scores by sex, with age gradient
I see no clear bias either way, perhaps a slight correlation with score and age within the mode (20-30 years) and then no correlation for higher ages. The female distribution seems most well-behaved, although the gap from 11 points up is much sharper here as well.
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
group_by(gender, age_bin) %>%
ggplot(aes(x=points)) +
geom_histogram(boundary=0,binwidth=2, aes(fill=gender)) +
facet_wrap(~age_bin)
fig. 1.5b, exam scores by age group, with labeled sex
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
ggplot(aes(x=age_bin, y = points)) +
stat_boxplot(geom = "errorbar", width = 0.25) +
geom_boxplot()
fig. 1.6a, exam score distributions by age group, box plot sequence
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
ggplot(aes(x=age_bin, y = points)) +
stat_boxplot(geom = "errorbar", width = 0.25) +
geom_boxplot()+
facet_wrap(~gender)
fig. 1.6b, exam score distributions by age-sex group, box plot sequence
I see no correlation between age and exam score in these plots.
Finally, let us look at all the survey question scores together with the already explored variables in a correlation matrix.
lrn2014 %>%
select(gender, age, surf, stra, deep, attitude, points) %>%
ggpairs(aes(fill=gender, color=gender),lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 1.7, correlation matrix of all variables
There seem to be negative correlations between: - surf
and deep (but seemingly only strongly for male students) -
attitude and deep (weak, but stronger for male
students again) - stra and surf (weak)
And a possitive correlation between points and
attitude! This is the strongest linear relationship in the
data. And we can verify that age does not seem to have an effect at
all.
There also seems to be a relationship between attitude
and gender, but no relationship between
attitude and points.
Based on the data exploration, attitude seems the most
likely candidate, and next after that: stra and
surf.
Let’s start with a simple regression model of
points ~ attitude, which will be our baseline.
attitude_model <- lrn2014 %>%
lm(points ~ attitude, data = .)
summary(attitude_model)
##
## Call:
## lm(formula = points ~ attitude, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
There is clearly a statistically significant relationship between
attitude and points. But R-squared is only
around 18.5%, so there is a lot of variance not explained by the
model.
three_var_model <- lrn2014 %>%
lm(points ~ attitude + stra + surf, data = .)
summary(three_var_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The adjusted R-squared is higher, which means our model is capturing more of the underlying interactions than before, although still below 20%.
It seems that the relationship between points and
surf is not statistically significant.
Let’s drop surf and keep stra, and try
again.
two_var_model <- lrn2014 %>%
lm(points ~ attitude + stra, data = .)
summary(two_var_model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Right, this is our best fit yet, based on the adjusted R-squared.
Depending on what our choice of significance level would be in a
hypothesis test, the interaction with stra would be
ignored. At a standard level of a = 0.05, we wouldn’t
reject the null hypothesis that stra has a linear effect on
points.
Let’s test another model, with nothing but stra.
strastra_model <- lrn2014 %>%
lm(points ~ stra, data = .)
summary(stra_model)
##
## Call:
## lm(formula = points ~ stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5581 -3.8198 0.1042 4.3024 10.1394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.233 1.897 10.141 <2e-16 ***
## stra 1.116 0.590 1.892 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.01538
## F-statistic: 3.578 on 1 and 164 DF, p-value: 0.06031
We get very close to a statistically significant result at a standard
significance level of 0.05, but not quite. Let’s drop
stra, since that’s what the assignment asked us to do.
The model we will be using is therefore the baseline model with one
predictor: attitude.
Let’s rerun that summary:
summary(attitude_model)
##
## Call:
## lm(formula = points ~ attitude, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The fitted regression coefficients are: - intercept: 11.6372 -
attitude: 3.5255
Which means the that the model predicts the conditional mean of the
exam scores, given an attitude value as:
points = 11.6372 + 3.5255*attitude
If we multiply the attitude coefficient with the range of the
attitude in the population, we can get an idea of how the model assigns
expected exam scores based on a students attitude.
am <- mean(lrn2014$attitude)
as <- sd(lrn2014$attitude)
print("Range of predictor term within a sd:")
## [1] "Range of predictor term within a sd:"
3.5255*c(am-as, am, am+as)
## [1] 8.506549 11.079839 13.653130
print("Range of ŷ_i within a sd:")
## [1] "Range of ŷ_i within a sd:"
11.6732 + 3.5255*c(am-as, am, am+as)
## [1] 20.17975 22.75304 25.32633
print("Range of ŷ_i within two sd's:")
## [1] "Range of ŷ_i within two sd's:"
11.6732 + 3.5255*c(am-2*as, am, am+2*as)
## [1] 17.60646 22.75304 27.89962
spec(lrn2014)
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
So, assuming attitude is Gaussian and that the sample
stddev is a good estimate, the model assigns: an exam score: - between
20.17975 and 25.32633 to a majority of students (about 68%, one stddev
in a Gaussian captures 34.1% of the population) - between 17.60646 and
27.89962 to a super-majority (95%) of students
If we look back at the exam score distribution in figure 1.4, this does capture the mode of the distribution.
The multiple R-squared value is 0.1906, the standard way to express
this is that “19% of the variation in exam scores is explained by the
attitude variable” (see MABS4IODS, 3.2.1).
I would interpret this to mean that, using the linear model, given
attitude we estimate the expectation of the standard error
(squared) of this prediction to be roughly 80% of what it would be when
simply using the sample mean as a predictor. The estimation assumes the
sample is representative, because we’re using residuals to get this
number.
I’m not quite sure if this more elaborate interpretation is exact, but it’s what I was able to piece together from sources online, mostly wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination).
par(mfrow=c(2,2))
plot(attitude_model)
2.0 Regression diagnostics
par(mfrow=c(1,1))
The assumptions of the model are that: 1. the constant variance assumption (homoscedasticity) 2. the normality assumption: 3. there’s a linear relationship between the predictor and the response
The Residuals vs. Leverage plot checks that there aren’t any influential outliers that are affecting the fit of the regression coefficients. The plot has a dashed line showing a critical Cooks’s distance value. In our case this dashed line is not visible. Essentially, if a point is very far right and has a high standardized residual (off the central line), it’s an higly influential point and will have to be looked into.
A highly influential point may be an indication of a point that should be excluded, but this has to be done on a case-by-case basis. It might also mean that assumption 3 is violated, that there isn’t a linear relationship between predictors and the response.
The residuals vs fitted plot (and the scale-location plot) gives us a visual way to check for heteroscedasticity (violation of assumption 1). If the red mean-line is not horizontal, it means the residuals have a bias in some region of the response distribution (the vairance is not constant).
I don’t think this means the data is heteroscedastic, it certainly
doesn’t look like it is. But I’m not so familiar with these visual
checks, so I searched the web for a homoscedasticity test, and found the
Breusch-Pagan Test in the lmtest library.
attitude_model_bptest <- bptest(attitude_model)
attitude_model_bptest
##
## studentized Breusch-Pagan test
##
## data: attitude_model
## BP = 0.0039163, df = 1, p-value = 0.9501
A p-value of 0.95 means that there is definitely very
little evidence that the model is heteroscedastic. Ok, good! The trend
has to be much clearer than that then.
This plot compares the “ideal” quantiles to the sample quantiles. This is used to test for normalcy by comparing the residual distribution against a theoretical normal distribution.
There are some outliers at the bottom left, which may indicate a bimodal distribution (remember that the exam scores looked bimodal as well, fig 1.4). The plot also curves down a little at the upper end, which I believe usually indicates a light left skew.
let’s test the residuals for normalcy, using the Shapiro-Wilk test:
shapiro.test(attitude_model[["residuals"]])
##
## Shapiro-Wilk normality test
##
## data: attitude_model[["residuals"]]
## W = 0.97394, p-value = 0.003195
The p-value is quite small, 0.003, so we to reject the null hypothesis that the residuals are normally distributed.
Hmm, maybe the issue is the grading scale? I’ve tried to fix this, and I can ge the QQ-plot to look nicer with some transformations, but it makes the model test statistics worse. This is the best I can do for now.
library(dplyr)
library(readr)
data_url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
df <- read_csv(data_url)
## `curl` package not installed, falling back to using `url()`
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The variables in the data set are:
colnames(df)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset contains student school performance data averaged over two courses: portuguese and mathematics (the variables are: G1, G2, G3, absences, failures, paid). To be exact, the “paid” column is only from one of the courses. The other variables are demographic and social, the social variables were collected using surveys.
Two variables: alc_use and high_use are transformations of the original dataset added for this assignment. alc_use is the average alcohol use per day, combining self-reported weekday and weekend usage on a scale of 1 to 5. high_use is a boolean indicating whether the alc_use variable is more than 2.
I choose:
My hypotheses is:
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)
library(ggplot2)
library(GGally)
df %>% select(alc_use, high_use, Fedu, Medu, absences, G3) %>%
ggpairs(aes(fill=high_use, color=high_use))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig. 1.1, correlation matrix of variables, colored by high_use value
Two out of four hypotheses seem to have some support in the data after this exploration.
Let’s do a logistic regression model using the two statistically significant correlations: G3 and absences.
model <- glm(high_use ~ G3 + absences, data = df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38732 0.43500 -0.890 0.373259
## G3 -0.07606 0.03553 -2.141 0.032311 *
## absences 0.08423 0.02309 3.648 0.000264 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.93 on 367 degrees of freedom
## AIC: 435.93
##
## Number of Fisher Scoring iterations: 4
Both G3 and absences have a p-value of less than 0.5 for the Wald tests of the fit, but we have a better test that’s easier to interpret, the confidence intervals from the model.
coeffs <- coefficients(model)
odds_ratio <- coeffs %>% exp
odds_ratio
## (Intercept) G3 absences
## 0.6788751 0.9267616 1.0878762
The odds ratios for G3 and absences are roughly 0.926 and 1.088 For G3, the OR is less than one because the correlation between the variables is negative.
The way to interpret these is that: - for each decrease in final grade average, the odds increase by roughly 8.0% that a student has high alcohol use (1/0.9267 ~= 1.08) - for each increase in absences per course, the odds increase by roughly 8.7% that a student has high alcohol use
But that’s just the average effect the model has fit, let’s look at confidence intervals in the odds ratio
ci <- confint(model) %>% exp
## Waiting for profiling to be done...
ci
## 2.5 % 97.5 %
## (Intercept) 0.2857593 1.5832545
## G3 0.8639686 0.9935498
## absences 1.0419618 1.1407755
The 95% confidence intervals are both on one side of unity, so we can say with 95% certainty that there is an effect for both variables, and the effect increases the odds by: - a factor in the range of [1.007, 1.15] for each decrease in final grade average (again,inverting the odds ratio because the effect is negative) - a factor in the range of [1.04, 1.14] for each increase in absences per course
There is an effect, although the final grade average effect goes damn near 1 at the low end of the confidence interval, the unit increase in odds is only 0.7%! For absences, the confidence interval is a little tighter, but it still seems a little wide to me for practical use. We would need more samples in order to get a tighter interval.
The two hypotheses that survived the initial exploration both match the outcomes of the logistic regression, and now we can quantify the quality of an estimate of the effect.
Let’s test out the model with a confusion matrix.
probabilities <- predict(model, type = "response")
df <- df %>%
mutate(probability = probabilities) %>%
mutate(prediction = probability > 0.5)
table(high_use = df$high_use, prediction = df$prediction) %>%
prop.table %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.25675676 0.04324324 0.30000000
## Sum 0.93783784 0.06216216 1.00000000
The confusion matrix tells us that the model has a false positive rate of ~2% and a false negative rate of ~26%. That’s pretty good! High false negative rates are not so bad, they are just missed cases. High false positives would mean that the model is unreliable and cannot be used as an indicator (in this case. The importance of different error types depend on the specific use case and the meaning of negative and positive).
On the other hand, the confusion matrix also tells us that the model preficts 94% of all students to have non-high alcohol use, while in reality the number is 70%. So the model achieves this relative safe indicator status by being rather conservative.
We haven’t done a split into a model fit dataset and a validation set, so this confusion matrix is of limited value.
This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.
The methods used are linear discriminant analysis (LDA) and k-means clustering.
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.
The columns contain social statistics related to these census areas
(e.g. crim = crime rate, ptratio =
pupil-teacher ratio), data about the housing units in the area
(rm = avg # of rooms per unit, medv = median
housing unit value, age = prop. houses built before 1940),
and data about the location of the area (e.g. dis =
weighted mean of distances to employment centers, chas = 1
if by Charles River, 0 if not by Charles River).
Some of the columns are a little counter-intuitive or difficult to
interpret. E.g., the column age is the proportion of houses
built before 1940, and the column lstat is the proportion
of the population that is lower status. From the Harrison &
Rubinfield paper, lower status means: “1/2 * (proportion of adults
without some hig h school education and proportion of male workers
classified as laborers)”.
Ok, before we move forward, we did see a small issue here, let’s
change chas from an integer to a boolean.
library(dplyr)
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Mode :logical
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 FALSE:471
## Median : 0.25651 Median : 0.00 Median : 9.69 TRUE :35
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Looking at this summary, all columns are definitely not created
equal. I am mostly looking at the difference between the mean and
median, which — if they differ by much — can indicate that a variable is
not normally distributed. Some of the columns are close to normal
distributions, but e.g. zn has a median of 0 and a mean of
11.36. Other highly skewed columns are rad,
crim, tax, chas, and
black.
library(ggplot2)
library(GGally)
Boston_explore %>%
ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix
Ok, lot’s to unpack here, let’s start with the a visual check of each
variable’s distribution (the diagonal). Almost none of the columns look
normally distributed, with perhaps the exception of rm, the
number of rooms.
There are lots of interesting correlations, just looking at the
scatter plots, the three values rm, medv, and
lstat seem to have strikingly strong relationships with
each other with medv, which makes sense to me.
The rad variable, which is an “index of accessibility to
radial highways, is clearly a bimodal, or one could even say a split
distribution. A subset of areas have a much higher index than the
others, and in the scatter plots, this clearly visible line of that
higher-index population seems to consistently cover different ranges of
the other variable than the lower-index population. The effect is most
clearly noticeable in the crim, tax,
nox, dis and black scatter
plots.
dis and nox also have a strikingly
logarithmic-looking relationship.
In general, nearly every variable seems to be correlated with every
other variable, excepting the chas (area is by the Charles
river) column.
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot
Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.
We see strong correlations (large balls) between: - dis
and (zn, indus, nox, and
age) - tax and rad, and this is a
very strong correltaion, they seem to capture much of the same
variation within them - tax and (crim,
indus, and nox) - ditto for rad -
lstat and medv
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
Let’s run the scale function on the dataset.
Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.
# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.
set.seed(179693716)
n <- nrow(Boston_scaled)
split <- sample(n, size = n * 0.8)
train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]
Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2475248 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.85485581 -0.8956178 -0.08120770 -0.8436157 0.4413548 -0.8603410
## med_low -0.09212737 -0.2763238 0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309 0.1791469 0.26643202 0.3710782 0.1054813 0.4474274
## high -0.48724019 1.0170298 -0.04947434 1.0820880 -0.4230980 0.8279971
## dis rad tax ptratio black lstat
## low 0.8290333 -0.6897369 -0.7642825 -0.43374243 0.3887547 -0.77819536
## med_low 0.3902633 -0.5443053 -0.4238660 -0.08385135 0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126 0.1295279 0.06516648
## high -0.8564037 1.6390172 1.5146914 0.78181164 -0.8395683 0.90684062
## medv
## low 0.52171440
## med_low -0.04934231
## med_high 0.14052867
## high -0.73164690
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12991855 0.717286290 -1.11409479
## indus 0.01318562 -0.275547338 0.12479393
## chas -0.08510193 -0.049959856 0.17108503
## nox 0.50033989 -0.751262885 -1.27531579
## rm -0.10015731 -0.108297341 -0.19190232
## age 0.21057791 -0.358696692 -0.14782657
## dis -0.05526417 -0.347261139 0.38293342
## rad 3.16593191 1.032967549 -0.36643042
## tax -0.01168751 -0.096843332 1.10072573
## ptratio 0.12524469 0.004459516 -0.41009220
## black -0.12747580 -0.014860435 0.07573253
## lstat 0.22137355 -0.316638242 0.31488347
## medv 0.17976929 -0.444140572 -0.22238434
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9509 0.0340 0.0151
arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = scale * heads[,choices[1]],
y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
text(scale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot
We see that out of the two first linear discriminants, LD1 nearly
perfectly separates the data into two clusters: those with high crime
rate, and those with other values. rad has the clearly
highest coefficient in LD1, which can be seen both from the biplot and
the LDA summary.
LD2 seems to find another axis within the data that explains a
smaller effect. The largest coefficients in LD2 belong to
nox, medv, and zn.
# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)
Let’s predict the crime rate quartiles in the test set and cross tabulate:
# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)
# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
## predicted
## correct low med_low med_high high
## low 15 9 0 0
## med_low 5 13 8 0
## med_high 1 10 19 1
## high 0 0 0 21
nrow(test)
## [1] 102
and here’s the same table as a confusion matrix:
image(tab)
fig. 6.1 Confusion matrix of the LDA fit
The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).
68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!
radlda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit
Using only rad gives us a model that seems to have
exactly the same predictive power in the high vs not
high case, but loses information in the lower quartiles.
This fits with our earlier analysis of how LD1 was mostly
rad and was able to carve out most of the high
crime rate areas from the rest.
The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.
I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.
# determine the number of clusters
#k_max <- 10
# calculate the total within sum of squares, take 5 samples to stabilize the variance
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})
df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))
df <- df %>% rowwise() %>% mutate(twcss = mean(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
geom_line() +
geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
theme(legend.position="none") +
scale_x_continuous(breaks=df$k) +
scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot
It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.
k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.
I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.
Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 28 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k2m$centers
## crim zn indus chas nox rm
## 1 -0.3894158 0.2621323 -0.6146857 0.002908943 -0.5823397 0.2446705
## 2 0.7238295 -0.4872402 1.1425514 -0.005407018 1.0824279 -0.4547830
## age dis rad tax ptratio black lstat
## 1 -0.4331555 0.4540421 -0.5828749 -0.6291043 -0.2943707 0.3282754 -0.4530491
## 2 0.8051309 -0.8439539 1.0834228 1.1693521 0.5471636 -0.6101842 0.8421083
## medv
## 1 0.3532917
## 2 -0.6566834
It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
It seems that the model has picked two clusters that have the following relative position to each other:
So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.
So it seems like the red regions are new developments, new suburbs
farther away from the city, and there may be some price discrimination
in the house prices (medv) connected with the high
proportion of black residents living there. You can see effects of
segregationist US housing policy in the data. E.g., the 1949 housing act
set up a framework to subsidize public housing for whites with clauses
forbidding resale to black people, which means that black people paid
more for housing (see e.g. “Abramovitz & Smith, The Persistence
of Residential Segregation by Race, 1940 to 2010: The Role of Federal
Housing Policy,Families in Society, Vol. 102, Issue 1” for
more).
Why not try with k=3 for good measure? Maybe we can find additional structure in the data.
set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k3m$centers
## crim zn indus chas nox rm age
## 1 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456 0.1673019
## 3 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873 0.7828949
## dis rad tax ptratio black lstat
## 1 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397 -0.05818052
## 3 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974 0.90799414
## medv
## 1 0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.
The differences in the red and green clusters seem to be: - the red
cluster has higher values of zn more big plots - the red
cluster has a smaller proportion of older buildings - the red cluster
consists of exclusively black neighbourhoods, while the green cluster
has some spread - the green cluster is between the red and blue clusters
when it comes to proportion of laborers and uneducated adults
Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.